Background

Premise engages citizens to crowdsource data in their own communities, thereby illuminating local points of interest that are essential. Depending on the geographic region, critical facilities may not be easily discoverable with modern search engines or mapping services. The current case study is from a recent campaign in Mexico City, where contributors were asked to find and document pharmacies. Data from contributor submissions were submitted and analyzed in order to best approximate unique locations of pharmacies and to understand surrounding features through text extraction. This study should serve as a proof of concept, or prototype, of applied statistical methods to generate insights from campaign-driven crowdsourced data and images.

Objective

The project can be divided into two primary objectives: to extract meaning from images and text submitted by contributors, and to harmonize location data from geotags to devise a list of unique pharmacy locations. The desired output is a corpus of texts derived from the images, insights that be gleaned from them, and a final list of unique pharmacy locations with a corresponding confidence index.

Data

Two forms of data were provided: a directory of 898 images submitted by a total of 233 participants, and a csv file with the same number of rows, each corresponding to exactly one of the images. The csv data combined metadata such as gps location and timestamp with form fields filled in by the participants themselves. The fields solicited questions such as how often users visited the pharmacy, whether they were confident in its quality, their opinion about the safety of the neighborhood, and so on. More importantly, users filled out the name field, which proved critical to identifying specific pharmacies and differentiating between pharmacies with different names that were clustered together.

 [1] "X1"                                                                                                        
 [2] "campaign_id"                                                                                               
 [3] "project_id"                                                                                                
 [4] "form_id"                                                                                                   
 [5] "task_id"                                                                                                   
 [6] "sub_id"                                                                                                    
 [7] "user_id"                                                                                                   
 [8] "timestamp"                                                                                                 
 [9] "photo"                                                                                                     
[10] "lat"                                                                                                       
[11] "lon"                                                                                                       
[12] "campaign_name"                                                                                             
[13] "task_title"                                                                                                
[14] "how safe do you feel in this area?"                                                                        
[15] "how confident are you in the quality of this pharmacy?"                                                    
[16] "have you ever visited this pharmacy?"                                                                      
[17] "please do not submit hospitals, clinics, or non-pharmacy healthcare entities."                             
[18] "is this pharmacy currently open?"                                                                          
[19] "have you ever gotten medication or medical supplies at this pharmacy"                                      
[20] "a pharmacy is a business or vendor where a pharmacist sells prescription and non-prescription medications."
[21] "name"                                                                                                      
[22] "how often do you visit this pharmacy?"                                                                     
[23] "cluster"                                                                                                   

Observations

Initially, a map with geotagged locations and their associated images as pop-ups was created in Mapbox for exploratory analysis (link). After exploring the map and images, some important observations were made:

  • Photos of the same pharmacy are often taken from different angles (see Fig 1)
  • In some cases (see Fig 2), photos are ostensibly taken from the same position but there is high variation in gps location
  • Pharmacies of various kinds sometimes cluster together (see Fig 3)
  • A small proportion of all photos were of poor quality (see Fig 4), blurry, taken in low light conditions or inside

FarmaciaFarmacia del Ahorro.      FIGURE 1. Unnamed pharmacy                                                                        FIGURE 2. Farmacia del Dr. Ahorro

Cluster of pharmaciesPoor quality.      FIGURE 3. Cluster of pharmacies                                                                        FIGURE 4. Poor quality photo


To quantify the number of blurred images, I used the OpenCV package in Python to apply variation of the Laplacian 1, a standard method to detect blurring. This revealed that about 8% of the total number were blurred, which as we will see has a direct impact on the quality of text able to be extracted.

Other assumptions

This study assumes that is GPS accuracy is within the normal range (~5m) and that participants are not using fake GPS or spoofing to mask their real location when taking photos.

Methodology

The methodology consisted of five main components: text extraction, spatial clustering, text clustering, text matching, and building a confidence index. Text detection, extraction, and spatial clustering were performed in Python using EAST and OCR deep learning models, while the latter was performed with DBSCAN. Text manipulation, matching, and subsequent clustering was performed in R. A custom confidence index was then designed to describe the total strength of the evidence associated with each pharmacy location, as well as an explanation about how the statistic changes. The new and old locations were then plotted on a map.


FIGURE 5. Workflow diagram

FIGURE 5. Workflow diagram


EAST

The first step in text extraction is text detection or locating text in an image. For this study, EAST - An Efficient and Accurate Scene Text Detector - was used with an existing PyTorch implementation. 2. This is a robust deep learning method for text detection that performs well on unstructured text. First, images were loaded and preprocessed with OpenCV, and a pre-trained model was configured to bound all text identified in the images. Bounding boxes were then used to crop the images into their textual fragments, and each of these fragments was passed Tesseract, the OCR engine. The modified source code can be found in the author's Github repository.

Caption for the picture.Caption for the picture. FIGURE 6. Green bounding boxes indicated the text identified by EAST

OCR

Once bounding boxes were identified, Tesseract 3 - an open source OCR engine - was configured in Python to recognize text from the cropped bounding boxes for each image. The Spanish language pack was applied for this use case, and each text box was treated as a single text line.

pytesseract.image_to_string(cropped_image, config='--tessdata-dir tessdata --psm 7', lang="spa")

Caption for the picture.Caption for the picture.

Clustering

Spatial clustering was performed first, and then for points within those spatial clusters, we used a combination of partitioning around medoids (PAM) on Levenshtein distances.

DBSCAN
Diagram.

Density-Based Spatial Clustering of Applications with Noise is an unsupervised machine learning algorithm that is good for grouping points together that are close to each other based on a distance measurement (Euclidean distance) and a minimum number of points. The most important feature of the algorithm is that it does not require one to specify the number of clusters a priori and that it joins sets of radius epsilon eps iteratively. Three input parameters are required:

  • eps: two points are considered neighbors if the distance between the two points is below the threshold eps

  • min_samples: The minimum number of neighbors a given point to have in order to be classified as a core point (clusters have a minimum size of 2)

  • metric: the metric used when calculating distance between instances (i.e. Euclidean distance)

For this study, 100 m was selected heuristically as the optimal radius. In our case, it would be worse to underestimate the size of the eps than to overestimate it becasue there is still a second step of clustering ahead (text-based) that can pare down our results if there are too many pharmacies in a cluster. If the initial spatial clusters are too small, then we risk starting off assuming ther are more clusters than there really are. 100 m is an ideal measurement because it is very difficult to get a clear photo of a sign/storefront beyond that range using a mobile phone.

Four distinct clusters in a neighborhood.
FIGURE 7. Four distinct clusters in a neighborhood


It is important here to note that DBSCAN only takes us halfway to our objective - it is useful for identifying spatial clusters of photographs but we know from our observations (see Figure 3) that multiple pharmacies can also be spatially clustered, so how do we separate them? After spatial clustering, we need to try to find any existing clusters within clusters using pharmacy names. Following the next sections on text matching, the second part of clustering will be described in the section on PAM (partitioning around medoids).

Text matching

Text from all fields - OCR output and user input - was cleaned by forcing to lowercase and by removing excess whitespace, punctuation, and Spanish stopwords. 4.

Levenshtein Distance

To answer the very basic yet important question - what is the name of the pharmacy? - we used Levenshtein Distance (LD) 5. LD is a measure of the similarity between two strings - the distance is an integer and represents the number of deletions, insertions, or substitutions (each one of these operations is counted as 1) required to transform the source string into the target string. The LD method is often used to correct spelling mistakes, which is why it was chosen as a string distance measure for this study. On one hand, the name input field was prone to human spelling error, and on the other, text output from OCR was very often incomplete, missing letters and including misrecognized letters. The LD method was applied to both.

Levenshtein Distance.

LD was used to detect common misspellings and to differentiate the word f a r m a c i a from the spellings of proper pharmacy names. A distance matrix was created for all of the individual words in the name column (entries of multiple words were split and each extracted). A visual inspection of the distance matrix demonstrates that if distance is lower than or equal to 3, the words f a r m a c i a or a close variation of this word is present. We can also see that certain words like "farmastar" and "farmacia" have a low distance of 3, but we know that the former is the proper name of the pharmacy.

stringdist('farmastar', 'farmacia', method='lv') 
[1] 3
name column LV distance
farmacia 0
farmcia 1
farmacias 1
frarmacia 1
farmamia 1
farmacity 2
farmcias 2
famacias 2
fatmacias 2
frmacias 2
farmaúnica 3
farma 3
farmastar 3
farmasimi 3
farmaluz 3
farmafw 3
farmamigo 3
farmafe 3
famcias 3
farmalyn 3
The purple colored records above show low distance words that actually represent

proper pharmacy names.

Since we do not want to remove these words as they contain important information, we cannot simply use an arbitrary distance threshold such as 3. One solution is to manually compose a list of common misspellings from the distance matrix. This list was then used to mask the name column and derive the proper names of pharmacies (see below).

 [1] "farmacia"  "farmcia"   "frarmacia" "farmacias" "farmcias"  "farmcias" 
 [7] "farmamia"  "frmacias"  "famacias"  "famcias"   "fatmacias"
original minus_stop_words proper_name
farmacia de dios farmacia dios dios
farmacias similares farmacias similares similares
farmacia urupan 2 farmacia urupan 2 urupan 2
farmacias similares farmacias similares similares
farmapronto farmapronto farmapronto
PAM

Clustering within spatial clusters is the final step in locating pharmacies. DBSCAN was a good start, but it needs to be taken one step further. PAM or partitioning around medoids is a clustering algorithm that is a classical partitioning technique 6 , clustering the dataset of n objects in to k clusters. After creating a Levenshtein distance matrix for each unique spatial cluster, we can use apply PAM to cluster pharmacy names within these clusters.

The example below shows how PAM was able to split one 14 point cluster into two - a 12-point and 2-point cluster - based solely on pharmacy name. If pharmacies had blank names (this was only in 3% of submissions) and fell into a spatial cluster with named pharmacies, then they were assumed to belong to that cluster. In the case of multiple pharmacy clusters, a blank name would be assigned to the cluster whose spatial mean was closest to its location. In the case of a cluster of blank names, the final location name is an NA value.

Caption for the picture.Caption for the picture.

Confidence index

Location accuracy

Without ground truth measurements, it is not possible to build a predictive model and identify the most meaningful variables (number of unique user submissions, presence of correctly matching text in photo, etc) in locating pharmacies or other points of interests from crowdsourced data. However, one can take an heuristic approach based on the Central Limit Theorem 7 and make the assumption that the higher the sample size (individual contributions), the more the likely the sample mean (in this case the sample spatial mean) will equal the population mean. Although in this study, the population is an abstract concept (it could be represented, for example, by the total number of residents in Mexico City submitting photos of one pharmacy), the concept of distribution makes sense. As a general rule, sample sizes equal to or greater than 30 are sufficient for the Central Limit Theorem to hold. Based on this concept, a simple method can be devised to rate our confidence in location accuracy, which is undoubtedly a function of sample size.

Given a theoretical Pharmacy A, every incremental participant that documents the location of Pharmacy A after the first one is valuable up to a certain extent. Two unique users verifying the of location Pharmacy A is far better than one, and three is significantly better than two. Starting with zero, each incremental user's verification becomes slightly less valuable than the previous one, however, a new user "contributes" much more to the confidence score when the sample size is low. This can be represented mathematically as a limit function, where the score will always be between 0 and 1 but never converge to 1:

\[\lim_{x\to\infty} f(1- \frac{1}{\sqrt{x}})\]

Thus, a sample size of 1 will acheive a score of 0, but a sample size of 2 will acheive 0.29, 3 will achieve 0.42, 4 a 0.5, and so on. Of course, this model should be tested with ground truth data, because there may be a better one that fits (maybe in reality less than 30 samples are needed to generate high confidence and the model can be replaced with one that it converges faster to 1).

Point-of-interest (POI) matching accuracy

POI matching should not have any bearing on location accuracy. It as apparent from the dataset that photos submitted at highly variable locations often show the exact same object (pharmacy) from the same angle. This means that we should avoid triangulating distances using the image and instead focus on the substance of the text. Is the object in the image actually the point-of-interest that we are interested in? We have used OCR to extract text from the images, the LD method to determine: 1. whether or not image text matches f a r m a c i a and, where possible, 2: whether the image text matches the name of the pharmacy input by the user.

Each photo can threfore have betwen 0 and 2 text matches, which should be treated using the same cumulative exponential logic as location accuracy. The more matches, the higher our confidence that a particular point-of-interest is what participants say it is. Given the difficulty of accurately extracting text from natural scenes, if one submission produces two matches, each match should contribute to n, and the same limit function should be applied to derive a POI score.

indexFun <- function(x) {
  1- 1/sqrt(x)
}

p9 <- ggplot(data.frame(x = c(1, 30)), aes(x = x)) +
  stat_function(fun = indexFun, colour = "dodgerblue3",  size = 1.5) +
  ggtitle("Confidence index of locations and POI type") +
  xlab("Unique users / matches") + ylab("Confidence Index CI")
  
p9

Results

Point-of-interest matching produced positive results, with over 50% of photos containing text approximating f a r m a c i a , and over 50% of photos containing text that matched the name field of the submission. However, because the number of unique users submitting photos for each location was relatively low, the location accuracy confidence score was

Below, a final map can be seen with most likely locations of pharmacies based on the data. The number of points was reduced from 898 to 510, or by 43%.

pal <- colorFactor(c("navy", "red"), domain = c("new", "old"))

final_rx$mean_lat= as.numeric(final_rx$mean_lat)
final_rx$mean_lon= as.numeric(final_rx$mean_lon)

xy <- final_rx[,c(2,3)]

spdf <- SpatialPointsDataFrame(coords = xy, data = final_rx,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

rxy = raw[,c(6,5)]

rspdf <- SpatialPointsDataFrame(coords = rxy, data =raw,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))


m <- leaflet() %>% setView(lng = -99.1332, lat = 19.4326, zoom = 14)
m %>% addProviderTiles(providers$CartoDB.Positron) %>%
    addMarkers(
    data = spdf,
    popup = name
  )  %>%
    addCircleMarkers(
    data = rspdf,
    popup = name,
    radius = 3,
    color = 'red',
    stroke = FALSE, fillOpacity = 0.5
  )

What other insights did we get from OCR? The texts also provided information about other services and points of interest located in and around the pharmacies. For example, the words could also tell us about surrounding points of interest - for example the word medico (eng. 'doctor') came up nearly 100 times and consultorio ('office') came up over 50 times. Also frequently present were words like recargas ('phone top up') and servicio ('service').

Overall, our mean location accuracy and point-of-interest confidence scores were low, with a mean of .03 and 0.28, respectively. These numbers are not as indicative of bad accuracy as they are of too low a sample size. We should still have confidence in low index scores, but they are a signal that we need to incentivize more participants to make more submissions.

Limitations

Small sample sizes were the main limitation - the mean number of unique users per final location was only 1.12, which equates to less accuracy in the spatial means of clustered points. Furthermore, even with the latest technology, OCR is difficult to apply successfully to images in the wild, which means that it is not advisable to place too much importance on extracted text. In light of this, the current study relied heavily on the names submitted by the participants themseleves (the more unique participants submitting the same names at the same locations, the better).

Alternative methods

  • Experiment with verifying locations with Google Street View images in the near vicinity if it is cost-effective
  • Incentivize more participants to obtain a larger number of samples per pharmacy
  • Hire a local fixer - someone that can manually verify locations from the output of this pipeline.

Recommendations

More research should be done into industry grade OCR pipelines for text extraction. Participants should be incentivized to take more photos to increase n, and there should be a designated team member collecting ground truth data to help train future models.